{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demonstration of the topic coherence pipeline in Gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be using the `u_mass` and `c_v` coherence for two different LDA models: a \"good\" and a \"bad\" LDA model. The good LDA model will be trained over 50 iterations and the bad one for 1 iteration. Hence in theory, the good LDA model will be able come up with better or more human-understandable topics. Therefore the coherence measure output for the good LDA model should be more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import logging\n",
    "try:\n",
    "    import pyLDAvis.gensim\n",
    "except ImportError:\n",
    "    ValueError(\"SKIP: please install pyLDAvis\")\n",
    "    \n",
    "import json\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity\n",
    "\n",
    "from gensim.models.coherencemodel import CoherenceModel\n",
    "from gensim.models.ldamodel import LdaModel\n",
    "from gensim.models.hdpmodel import HdpModel\n",
    "from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet\n",
    "from gensim.corpora.dictionary import Dictionary\n",
    "from numpy import array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up logging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "logger = logging.getLogger()\n",
    "logger.setLevel(logging.DEBUG)\n",
    "logging.debug(\"test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As stated in table 2 from [this](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. We will be setting up two LDA models. One with 50 iterations of training and the other with just 1. Hence the one with 50 iterations (\"better\" model) should be able to capture this underlying pattern of the corpus better than the \"bad\" LDA model. Therefore, in theory, our topic coherence for the good LDA model should be greater than the one for the bad LDA model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "texts = [['human', 'interface', 'computer'],\n",
    "         ['survey', 'user', 'computer', 'system', 'response', 'time'],\n",
    "         ['eps', 'user', 'interface', 'system'],\n",
    "         ['system', 'human', 'system', 'eps'],\n",
    "         ['user', 'response', 'time'],\n",
    "         ['trees'],\n",
    "         ['graph', 'trees'],\n",
    "         ['graph', 'minors', 'trees'],\n",
    "         ['graph', 'minors', 'survey']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dictionary = Dictionary(texts)\n",
    "corpus = [dictionary.doc2bow(text) for text in texts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up two topic models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll be setting up two different LDA Topic models. A good one and bad one. To build a \"good\" topic model, we'll simply train it using more iterations than the bad one. Therefore the `u_mass` coherence should in theory be better for the good model than the bad one since it would be producing more \"human-interpretable\" topics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=50, num_topics=2)\n",
    "badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using U_Mass Coherence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View the pipeline parameters for one coherence model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following are the pipeline parameters for `u_mass` coherence. By pipeline parameters, we mean the functions being used to calculate segmentation, probability estimation, confirmation measure and aggregation as shown in figure 1 in [this](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CoherenceModel(segmentation=<function s_one_pre at 0x7f663ae82f50>, probability estimation=<function p_boolean_document at 0x7f663ae8a2a8>, confirmation measure=<function log_conditional_probability at 0x7f663ae8a668>, aggregation=<function arithmetic_mean at 0x7f663ae8aa28>)\n"
     ]
    }
   ],
   "source": [
    "print goodcm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpreting the topics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we will see below using LDA visualization, the better model comes up with two topics composed of the following words:\n",
    "1. goodLdaModel:\n",
    "    - __Topic 1__: More weightage assigned to words such as \"system\", \"user\", \"eps\", \"interface\" etc which captures the first set of documents.\n",
    "    - __Topic 2__: More weightage assigned to words such as \"graph\", \"trees\", \"survey\" which captures the topic in the second set of documents.\n",
    "2. badLdaModel:\n",
    "    - __Topic 1__: More weightage assigned to words such as \"system\", \"user\", \"trees\", \"graph\" which doesn't make the topic clear enough.\n",
    "    - __Topic 2__: More weightage assigned to words such as \"system\", \"trees\", \"graph\", \"user\" which is similar to the first topic. Hence both topics are not human-interpretable.\n",
    "\n",
    "Therefore, the topic coherence for the goodLdaModel should be greater for this than the badLdaModel since the topics it comes up with are more human-interpretable. We will see this using `u_mass` and `c_v` topic coherence measures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize topic models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pyLDAvis.enable_notebook()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<link rel=\"stylesheet\" type=\"text/css\" href=\"https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.css\">\n",
       "\n",
       "\n",
       "<div id=\"ldavis_el97211400780301404969092186121\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "\n",
       "var ldavis_el97211400780301404969092186121_data = {\"plot.opts\": {\"xlab\": \"PC1\", \"ylab\": \"PC2\"}, \"topic.order\": [2, 1], \"token.table\": {\"Topic\": [1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2], \"Freq\": [0.47860298324896949, 0.47860298324896949, 0.91234237411332397, 0.36742593082724595, 0.7348518616544919, 0.46586716704724729, 0.46586716704724729, 0.92768662134746982, 0.46384331067373491, 0.47951599225096281, 0.47951599225096281, 0.47068971991734737, 0.47068971991734737, 0.48713296080836122, 0.48713296080836122, 0.82644395095404677, 0.27548131698468226, 0.47093723669544113, 0.47093723669544113, 0.35893273744053222, 0.35893273744053222, 0.69154485611401006, 0.34577242805700503], \"Term\": [\"computer\", \"computer\", \"eps\", \"graph\", \"graph\", \"human\", \"human\", \"interface\", \"interface\", \"minors\", \"minors\", \"response\", \"response\", \"survey\", \"survey\", \"system\", \"system\", \"time\", \"time\", \"trees\", \"trees\", \"user\", \"user\"]}, \"mdsDat\": {\"y\": [-0.0, -0.0], \"cluster\": [1, 1], \"Freq\": [60.467873822949159, 39.532126177050834], \"topics\": [1, 2], \"x\": [-0.021780138119695133, 0.021780138119695133]}, \"R\": 12, \"lambda.step\": 0.01, \"tinfo\": {\"Category\": [\"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\"], \"Term\": [\"graph\", \"survey\", \"trees\", \"minors\", \"computer\", \"eps\", \"time\", \"response\", \"system\", \"user\", \"human\", \"interface\", \"eps\", \"system\", \"user\", \"interface\", \"human\", \"response\", \"time\", \"trees\", \"computer\", \"minors\", \"survey\", \"graph\", \"graph\", \"survey\", \"minors\", \"computer\", \"trees\", \"time\", \"response\", \"human\", \"interface\", \"user\", \"system\", \"eps\"], \"loglift\": [12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.28039999999999998, 0.23119999999999999, 0.19839999999999999, 0.1477, 0.1095, 0.012200000000000001, 0.0070000000000000001, -0.1706, -0.17130000000000001, -0.1948, -0.41599999999999998, -0.51039999999999996, 0.47710000000000002, 0.41909999999999997, 0.23960000000000001, 0.2157, 0.215, -0.010800000000000001, -0.019, -0.19489999999999999, -0.27900000000000003, -0.40910000000000002, -0.50729999999999997, -0.6835], \"Freq\": [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 1.75465564679149, 2.7659900500989569, 2.1326462675558417, 1.5111201940472374, 1.4482141989675958, 1.3004987462049202, 1.2929988949829556, 1.4204356927347734, 1.0645643538992473, 1.0378439934863242, 0.81882725859319405, 0.98788811129271803, 1.7337488666650895, 1.2340003684689818, 1.0475921883305965, 1.0248501190560948, 1.3656016220300282, 0.83042631347366114, 0.82404308609938193, 0.69832041590596561, 0.64478025118607762, 0.75942932568705601, 0.86402018415443749, 0.43750385028737271], \"Total\": [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.1921594970788627, 3.6300102342533944, 2.8920755932428976, 2.155900445233315, 2.1465346148735613, 2.1245418323043022, 2.1234252084566165, 2.7860373147648017, 2.0894144729553421, 2.0854361818169207, 2.0528276270621757, 2.7216369779578073, 2.7216369779578073, 2.0528276270621757, 2.0854361818169207, 2.0894144729553421, 2.7860373147648017, 2.1234252084566165, 2.1245418323043022, 2.1465346148735613, 2.155900445233315, 2.8920755932428976, 3.6300102342533944, 2.1921594970788627], \"logprob\": [12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, -2.302, -1.8468, -2.1069, -2.4514, -2.4939, -2.6015000000000001, -2.6073, -2.5133000000000001, -2.8016999999999999, -2.8271000000000002, -3.0640999999999998, -2.8763999999999998, -1.889, -2.2290000000000001, -2.3927, -2.4146999999999998, -2.1276000000000002, -2.6251000000000002, -2.6328, -2.7982999999999998, -2.8780999999999999, -2.7143999999999999, -2.5853999999999999, -3.2658999999999998]}};\n",
       "\n",
       "function LDAvis_load_lib(url, callback){\n",
       "  var s = document.createElement('script');\n",
       "  s.src = url;\n",
       "  s.async = true;\n",
       "  s.onreadystatechange = s.onload = callback;\n",
       "  s.onerror = function(){console.warn(\"failed to load library \" + url);};\n",
       "  document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "}\n",
       "\n",
       "if(typeof(LDAvis) !== \"undefined\"){\n",
       "   // already loaded: just create the visualization\n",
       "   !function(LDAvis){\n",
       "       new LDAvis(\"#\" + \"ldavis_el97211400780301404969092186121\", ldavis_el97211400780301404969092186121_data);\n",
       "   }(LDAvis);\n",
       "}else if(typeof define === \"function\" && define.amd){\n",
       "   // require.js is available: use it to load d3/LDAvis\n",
       "   require.config({paths: {d3: \"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\"}});\n",
       "   require([\"d3\"], function(d3){\n",
       "      window.d3 = d3;\n",
       "      LDAvis_load_lib(\"https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.js\", function(){\n",
       "        new LDAvis(\"#\" + \"ldavis_el97211400780301404969092186121\", ldavis_el97211400780301404969092186121_data);\n",
       "      });\n",
       "    });\n",
       "}else{\n",
       "    // require.js not available: dynamically load d3 & LDAvis\n",
       "    LDAvis_load_lib(\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js\", function(){\n",
       "         LDAvis_load_lib(\"https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.js\", function(){\n",
       "                 new LDAvis(\"#\" + \"ldavis_el97211400780301404969092186121\", ldavis_el97211400780301404969092186121_data);\n",
       "            })\n",
       "         });\n",
       "}\n",
       "</script>"
      ],
      "text/plain": [
       "PreparedData(topic_coordinates=            Freq  cluster  topics        x    y\n",
       "topic                                          \n",
       "1      60.467874        1       1 -0.02178 -0.0\n",
       "0      39.532126        1       2  0.02178 -0.0, topic_info=     Category      Freq       Term     Total  loglift  logprob\n",
       "term                                                          \n",
       "1     Default  2.000000      graph  2.000000  12.0000  12.0000\n",
       "6     Default  2.000000     survey  2.000000  11.0000  11.0000\n",
       "3     Default  2.000000      trees  2.000000  10.0000  10.0000\n",
       "0     Default  2.000000     minors  2.000000   9.0000   9.0000\n",
       "5     Default  2.000000   computer  2.000000   8.0000   8.0000\n",
       "4     Default  2.000000        eps  2.000000   7.0000   7.0000\n",
       "9     Default  2.000000       time  2.000000   6.0000   6.0000\n",
       "11    Default  2.000000   response  2.000000   5.0000   5.0000\n",
       "2     Default  3.000000     system  3.000000   4.0000   4.0000\n",
       "7     Default  2.000000       user  2.000000   3.0000   3.0000\n",
       "8     Default  2.000000      human  2.000000   2.0000   2.0000\n",
       "10    Default  2.000000  interface  2.000000   1.0000   1.0000\n",
       "4      Topic1  1.754656        eps  2.192159   0.2804  -2.3020\n",
       "2      Topic1  2.765990     system  3.630010   0.2312  -1.8468\n",
       "7      Topic1  2.132646       user  2.892076   0.1984  -2.1069\n",
       "10     Topic1  1.511120  interface  2.155900   0.1477  -2.4514\n",
       "8      Topic1  1.448214      human  2.146535   0.1095  -2.4939\n",
       "11     Topic1  1.300499   response  2.124542   0.0122  -2.6015\n",
       "9      Topic1  1.292999       time  2.123425   0.0070  -2.6073\n",
       "3      Topic1  1.420436      trees  2.786037  -0.1706  -2.5133\n",
       "5      Topic1  1.064564   computer  2.089414  -0.1713  -2.8017\n",
       "0      Topic1  1.037844     minors  2.085436  -0.1948  -2.8271\n",
       "6      Topic1  0.818827     survey  2.052828  -0.4160  -3.0641\n",
       "1      Topic1  0.987888      graph  2.721637  -0.5104  -2.8764\n",
       "1      Topic2  1.733749      graph  2.721637   0.4771  -1.8890\n",
       "6      Topic2  1.234000     survey  2.052828   0.4191  -2.2290\n",
       "0      Topic2  1.047592     minors  2.085436   0.2396  -2.3927\n",
       "5      Topic2  1.024850   computer  2.089414   0.2157  -2.4147\n",
       "3      Topic2  1.365602      trees  2.786037   0.2150  -2.1276\n",
       "9      Topic2  0.830426       time  2.123425  -0.0108  -2.6251\n",
       "11     Topic2  0.824043   response  2.124542  -0.0190  -2.6328\n",
       "8      Topic2  0.698320      human  2.146535  -0.1949  -2.7983\n",
       "10     Topic2  0.644780  interface  2.155900  -0.2790  -2.8781\n",
       "7      Topic2  0.759429       user  2.892076  -0.4091  -2.7144\n",
       "2      Topic2  0.864020     system  3.630010  -0.5073  -2.5854\n",
       "4      Topic2  0.437504        eps  2.192159  -0.6835  -3.2659, token_table=      Topic      Freq       Term\n",
       "term                            \n",
       "5         1  0.478603   computer\n",
       "5         2  0.478603   computer\n",
       "4         1  0.912342        eps\n",
       "1         1  0.367426      graph\n",
       "1         2  0.734852      graph\n",
       "8         1  0.465867      human\n",
       "8         2  0.465867      human\n",
       "10        1  0.927687  interface\n",
       "10        2  0.463843  interface\n",
       "0         1  0.479516     minors\n",
       "0         2  0.479516     minors\n",
       "11        1  0.470690   response\n",
       "11        2  0.470690   response\n",
       "6         1  0.487133     survey\n",
       "6         2  0.487133     survey\n",
       "2         1  0.826444     system\n",
       "2         2  0.275481     system\n",
       "9         1  0.470937       time\n",
       "9         2  0.470937       time\n",
       "3         1  0.358933      trees\n",
       "3         2  0.358933      trees\n",
       "7         1  0.691545       user\n",
       "7         2  0.345772       user, R=12, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[2, 1])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<link rel=\"stylesheet\" type=\"text/css\" href=\"https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.css\">\n",
       "\n",
       "\n",
       "<div id=\"ldavis_el97211400770039304485557333450\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "\n",
       "var ldavis_el97211400770039304485557333450_data = {\"plot.opts\": {\"xlab\": \"PC1\", \"ylab\": \"PC2\"}, \"topic.order\": [2, 1], \"token.table\": {\"Topic\": [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2], \"Freq\": [0.47136770689304686, 0.47136770689304686, 0.4714363999386586, 0.4714363999386586, 0.35354620598847575, 0.35354620598847575, 0.47153029671607571, 0.47153029671607571, 0.47123934722812777, 0.47123934722812777, 0.47117011634775935, 0.47117011634775935, 0.47121186304124502, 0.47121186304124502, 0.47112117283273591, 0.47112117283273591, 0.56545462979830119, 0.56545462979830119, 0.47101058417699188, 0.47101058417699188, 0.3534540666147919, 0.3534540666147919, 0.70681839866719087, 0.35340919933359544], \"Term\": [\"computer\", \"computer\", \"eps\", \"eps\", \"graph\", \"graph\", \"human\", \"human\", \"interface\", \"interface\", \"minors\", \"minors\", \"response\", \"response\", \"survey\", \"survey\", \"system\", \"system\", \"time\", \"time\", \"trees\", \"trees\", \"user\", \"user\"]}, \"mdsDat\": {\"y\": [-0.0, -0.0], \"cluster\": [1, 1], \"Freq\": [52.514670987555313, 47.48532901244468], \"topics\": [1, 2], \"x\": [-0.0024545927918686247, 0.0024545927918686247]}, \"R\": 12, \"lambda.step\": 0.01, \"tinfo\": {\"Category\": [\"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Default\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic1\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\", \"Topic2\"], \"Term\": [\"human\", \"eps\", \"graph\", \"time\", \"computer\", \"trees\", \"survey\", \"interface\", \"minors\", \"system\", \"user\", \"response\", \"time\", \"survey\", \"minors\", \"response\", \"system\", \"user\", \"interface\", \"trees\", \"computer\", \"graph\", \"eps\", \"human\", \"human\", \"eps\", \"graph\", \"computer\", \"trees\", \"interface\", \"user\", \"system\", \"response\", \"minors\", \"survey\", \"time\"], \"loglift\": [12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.16569999999999999, 0.0969, 0.064799999999999996, 0.036600000000000001, 0.036400000000000002, 0.036299999999999999, 0.017600000000000001, -0.0054999999999999997, -0.076200000000000004, -0.097299999999999998, -0.1303, -0.20930000000000001, 0.18970000000000001, 0.12670000000000001, 0.097600000000000006, 0.078, 0.0060000000000000001, -0.019800000000000002, -0.041799999999999997, -0.041799999999999997, -0.042099999999999999, -0.076899999999999996, -0.1193, -0.2223], \"Freq\": [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 2.0, 1.3159068896432811, 1.2280436128475054, 1.1891708953915239, 1.1560205364187532, 1.9262662286104497, 1.5409343444354711, 1.1341990044178432, 1.4776087862735434, 1.0323193862239972, 1.347613921157538, 0.9778202097248585, 0.90335077124627938, 1.2174037316232018, 1.1433566873711567, 1.4808706815950541, 1.0891666328148497, 1.351613153588417, 0.98786488070144896, 1.2886467810294173, 1.6107107357720052, 0.96616712147658812, 0.93320479292977043, 0.89455256339149813, 0.80718765131554704], \"Total\": [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.123094540958828, 2.1225961762390035, 2.1223756883212941, 2.1221876578953411, 3.5369769643824549, 2.8295811254648884, 2.1220638851192923, 2.8292219398619602, 2.1214860190388469, 2.8284846027525923, 2.1211768970960154, 2.120754502869481, 2.120754502869481, 2.1211768970960154, 2.8284846027525923, 2.1214860190388469, 2.8292219398619602, 2.1220638851192923, 2.8295811254648884, 3.5369769643824549, 2.1221876578953411, 2.1223756883212941, 2.1225961762390035, 2.123094540958828], \"logprob\": [12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, -2.4487000000000001, -2.5177999999999998, -2.5499999999999998, -2.5781999999999998, -2.0676000000000001, -2.2907999999999999, -2.5973000000000002, -2.3328000000000002, -2.6913999999999998, -2.4249000000000001, -2.7456, -2.8249, -2.4258000000000002, -2.4885999999999999, -2.2299000000000002, -2.5371000000000001, -2.3212000000000002, -2.6347999999999998, -2.3690000000000002, -2.1459000000000001, -2.657, -2.6917, -2.734, -2.8367]}};\n",
       "\n",
       "function LDAvis_load_lib(url, callback){\n",
       "  var s = document.createElement('script');\n",
       "  s.src = url;\n",
       "  s.async = true;\n",
       "  s.onreadystatechange = s.onload = callback;\n",
       "  s.onerror = function(){console.warn(\"failed to load library \" + url);};\n",
       "  document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "}\n",
       "\n",
       "if(typeof(LDAvis) !== \"undefined\"){\n",
       "   // already loaded: just create the visualization\n",
       "   !function(LDAvis){\n",
       "       new LDAvis(\"#\" + \"ldavis_el97211400770039304485557333450\", ldavis_el97211400770039304485557333450_data);\n",
       "   }(LDAvis);\n",
       "}else if(typeof define === \"function\" && define.amd){\n",
       "   // require.js is available: use it to load d3/LDAvis\n",
       "   require.config({paths: {d3: \"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\"}});\n",
       "   require([\"d3\"], function(d3){\n",
       "      window.d3 = d3;\n",
       "      LDAvis_load_lib(\"https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.js\", function(){\n",
       "        new LDAvis(\"#\" + \"ldavis_el97211400770039304485557333450\", ldavis_el97211400770039304485557333450_data);\n",
       "      });\n",
       "    });\n",
       "}else{\n",
       "    // require.js not available: dynamically load d3 & LDAvis\n",
       "    LDAvis_load_lib(\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js\", function(){\n",
       "         LDAvis_load_lib(\"https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.js\", function(){\n",
       "                 new LDAvis(\"#\" + \"ldavis_el97211400770039304485557333450\", ldavis_el97211400770039304485557333450_data);\n",
       "            })\n",
       "         });\n",
       "}\n",
       "</script>"
      ],
      "text/plain": [
       "PreparedData(topic_coordinates=            Freq  cluster  topics         x    y\n",
       "topic                                           \n",
       "1      52.514671        1       1 -0.002455 -0.0\n",
       "0      47.485329        1       2  0.002455 -0.0, topic_info=     Category      Freq       Term     Total  loglift  logprob\n",
       "term                                                          \n",
       "8     Default  2.000000      human  2.000000  12.0000  12.0000\n",
       "4     Default  2.000000        eps  2.000000  11.0000  11.0000\n",
       "1     Default  2.000000      graph  2.000000  10.0000  10.0000\n",
       "9     Default  2.000000       time  2.000000   9.0000   9.0000\n",
       "5     Default  2.000000   computer  2.000000   8.0000   8.0000\n",
       "3     Default  2.000000      trees  2.000000   7.0000   7.0000\n",
       "6     Default  2.000000     survey  2.000000   6.0000   6.0000\n",
       "10    Default  2.000000  interface  2.000000   5.0000   5.0000\n",
       "0     Default  2.000000     minors  2.000000   4.0000   4.0000\n",
       "2     Default  3.000000     system  3.000000   3.0000   3.0000\n",
       "7     Default  2.000000       user  2.000000   2.0000   2.0000\n",
       "11    Default  2.000000   response  2.000000   1.0000   1.0000\n",
       "9      Topic1  1.315907       time  2.123095   0.1657  -2.4487\n",
       "6      Topic1  1.228044     survey  2.122596   0.0969  -2.5178\n",
       "0      Topic1  1.189171     minors  2.122376   0.0648  -2.5500\n",
       "11     Topic1  1.156021   response  2.122188   0.0366  -2.5782\n",
       "2      Topic1  1.926266     system  3.536977   0.0364  -2.0676\n",
       "7      Topic1  1.540934       user  2.829581   0.0363  -2.2908\n",
       "10     Topic1  1.134199  interface  2.122064   0.0176  -2.5973\n",
       "3      Topic1  1.477609      trees  2.829222  -0.0055  -2.3328\n",
       "5      Topic1  1.032319   computer  2.121486  -0.0762  -2.6914\n",
       "1      Topic1  1.347614      graph  2.828485  -0.0973  -2.4249\n",
       "4      Topic1  0.977820        eps  2.121177  -0.1303  -2.7456\n",
       "8      Topic1  0.903351      human  2.120755  -0.2093  -2.8249\n",
       "8      Topic2  1.217404      human  2.120755   0.1897  -2.4258\n",
       "4      Topic2  1.143357        eps  2.121177   0.1267  -2.4886\n",
       "1      Topic2  1.480871      graph  2.828485   0.0976  -2.2299\n",
       "5      Topic2  1.089167   computer  2.121486   0.0780  -2.5371\n",
       "3      Topic2  1.351613      trees  2.829222   0.0060  -2.3212\n",
       "10     Topic2  0.987865  interface  2.122064  -0.0198  -2.6348\n",
       "7      Topic2  1.288647       user  2.829581  -0.0418  -2.3690\n",
       "2      Topic2  1.610711     system  3.536977  -0.0418  -2.1459\n",
       "11     Topic2  0.966167   response  2.122188  -0.0421  -2.6570\n",
       "0      Topic2  0.933205     minors  2.122376  -0.0769  -2.6917\n",
       "6      Topic2  0.894553     survey  2.122596  -0.1193  -2.7340\n",
       "9      Topic2  0.807188       time  2.123095  -0.2223  -2.8367, token_table=      Topic      Freq       Term\n",
       "term                            \n",
       "5         1  0.471368   computer\n",
       "5         2  0.471368   computer\n",
       "4         1  0.471436        eps\n",
       "4         2  0.471436        eps\n",
       "1         1  0.353546      graph\n",
       "1         2  0.353546      graph\n",
       "8         1  0.471530      human\n",
       "8         2  0.471530      human\n",
       "10        1  0.471239  interface\n",
       "10        2  0.471239  interface\n",
       "0         1  0.471170     minors\n",
       "0         2  0.471170     minors\n",
       "11        1  0.471212   response\n",
       "11        2  0.471212   response\n",
       "6         1  0.471121     survey\n",
       "6         2  0.471121     survey\n",
       "2         1  0.565455     system\n",
       "2         2  0.565455     system\n",
       "9         1  0.471011       time\n",
       "9         2  0.471011       time\n",
       "3         1  0.353454      trees\n",
       "3         2  0.353454      trees\n",
       "7         1  0.706818       user\n",
       "7         2  0.353409       user, R=12, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[2, 1])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pyLDAvis.gensim.prepare(badLdaModel, corpus, dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-14.0842451581\n"
     ]
    }
   ],
   "source": [
    "print goodcm.get_coherence()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-14.4434307511\n"
     ]
    }
   ],
   "source": [
    "print badcm.get_coherence()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using C_V coherence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pipeline parameters for C_V coherence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CoherenceModel(segmentation=<function s_one_set at 0x7f663ae8a050>, probability estimation=<function p_boolean_sliding_window at 0x7f663ae8a320>, confirmation measure=<function cosine_similarity at 0x7f663ae8a938>, aggregation=<function arithmetic_mean at 0x7f663ae8aa28>)\n"
     ]
    }
   ],
   "source": [
    "print goodcm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Print coherence values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.552164532134\n"
     ]
    }
   ],
   "source": [
    "print goodcm.get_coherence()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.5269189184\n"
     ]
    }
   ],
   "source": [
    "print badcm.get_coherence()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Support for wrappers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This API supports gensim's _ldavowpalwabbit_ and _ldamallet_ wrappers as input parameter to `model`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "model1 = LdaVowpalWabbit('/home/devashish/vw-8', corpus=corpus, num_topics=2, id2word=dictionary, passes=50)\n",
    "model2 = LdaVowpalWabbit('/home/devashish/vw-8', corpus=corpus, num_topics=2, id2word=dictionary, passes=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cm1 = CoherenceModel(model=model1, corpus=corpus, coherence='u_mass')\n",
    "cm2 = CoherenceModel(model=model2, corpus=corpus, coherence='u_mass')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-14.075813889\n",
      "-15.1740896045\n"
     ]
    }
   ],
   "source": [
    "print cm1.get_coherence()\n",
    "print cm2.get_coherence()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "model1 = LdaMallet('/home/devashish/mallet-2.0.8RC3/bin/mallet',corpus=corpus , num_topics=2, id2word=dictionary, iterations=50)\n",
    "model2 = LdaMallet('/home/devashish/mallet-2.0.8RC3/bin/mallet',corpus=corpus , num_topics=2, id2word=dictionary, iterations=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cm1 = CoherenceModel(model=model1, texts=texts, coherence='c_v')\n",
    "cm2 = CoherenceModel(model=model2, texts=texts, coherence='c_v')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.581114877802\n",
      "0.549865328265\n"
     ]
    }
   ],
   "source": [
    "print cm1.get_coherence()\n",
    "print cm2.get_coherence()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Support for other topic models\n",
    "The gensim topics coherence pipeline can be used with other topics models too. Only the tokenized `topics` should be made available for the pipeline. Eg. with the gensim HDP model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "hm = HdpModel(corpus=corpus, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# To get the topic words from the model\n",
    "topics = []\n",
    "for topic_id, topic in hm.show_topics(num_topics=10, formatted=False):\n",
    "    topic = [word for word, _ in topic]\n",
    "    topics.append(topic)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[u'minors',\n",
       "  u'system',\n",
       "  u'graph',\n",
       "  u'human',\n",
       "  u'interface',\n",
       "  u'eps',\n",
       "  u'trees',\n",
       "  u'computer',\n",
       "  u'user',\n",
       "  u'response',\n",
       "  u'survey',\n",
       "  u'time'],\n",
       " [u'minors',\n",
       "  u'trees',\n",
       "  u'time',\n",
       "  u'interface',\n",
       "  u'user',\n",
       "  u'survey',\n",
       "  u'system',\n",
       "  u'response',\n",
       "  u'human',\n",
       "  u'computer',\n",
       "  u'graph',\n",
       "  u'eps']]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "topics[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Initialize CoherenceModel using `topics` parameter\n",
    "cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-14.640667699204982"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cm.get_coherence()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hence as we can see, the `u_mass` and `c_v` coherence for the good LDA model is much more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable. The badLdaModel however fails to decipher between these two topics and comes up with topics which are not clear to a human. The `u_mass` and `c_v` topic coherences capture this wonderfully by giving the interpretability of these topics a number as we can see above. Hence this coherence measure can be used to compare difference topic models based on their human-interpretability."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
